In this course work I will explore dataset about students’ performance based on their habits and lifestyle. I have chosen this dataset, because it is rich on different discrete, continious variables with a good range and I assume good variance, which will be very useful in Data Science 2 class.
This dataset consists of 16 columns which could be categorized into these groups: - discrete variables (example: exercise frequency per week) - continious variables (example: sleep quantity) - categorical variables (example par time job, yes or no)
This is regression problem, because I will try to predict the student’s score for an exam, based on different variables.
Here I would like to collect, prepare, and explore my data. First thing is to import the data set.
dt_students <- fread(file = "./data/student_habits_performance.csv")
I would like to check, if i have some nullish data in my dataset. I think it is a good idea to go through all rows and colums and check, if there is a NA. I want to check it with built-in function in R complete.cases(data_table). This function returns TRUE or FALSE if row contains a NA value.
That looks great. Now we can move on to exploration. But before I start, It is crucial to install all needed libraries.
library(data.table)
library(ggcorrplot)
library(ggExtra)
library(ggplot2)
library(ggridges)
library(ggsci)
library(ggthemes)
library(RColorBrewer)
library(svglite)
library(viridis)
library(scales)
library(rpart)
library(rpart.plot)
library(factoextra)
I found some helpful functions in R, so we could have a look on our data. We will start with a structure, than we will get some statistic data and take a head() of the data
str(dt_students)
Classes ‘data.table’ and 'data.frame': 1000 obs. of 16 variables:
$ student_id : chr "S1000" "S1001" "S1002" "S1003" ...
$ age : int 23 20 21 23 19 24 21 21 23 18 ...
$ gender : chr "Female" "Female" "Male" "Female" ...
$ study_hours_per_day : num 0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
$ social_media_hours : num 1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
$ netflix_hours : num 1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
$ part_time_job : chr "No" "No" "No" "No" ...
$ attendance_percentage : num 85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
$ sleep_hours : num 8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
$ diet_quality : chr "Fair" "Good" "Poor" "Poor" ...
$ exercise_frequency : int 6 6 1 4 3 1 2 0 3 5 ...
$ parental_education_level : chr "Master" "High School" "High School" "Master" ...
$ internet_quality : chr "Average" "Average" "Poor" "Good" ...
$ mental_health_rating : int 8 8 1 1 1 4 4 8 1 10 ...
$ extracurricular_participation: chr "Yes" "No" "No" "Yes" ...
$ exam_score : num 56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...
- attr(*, ".internal.selfref")=<externalptr>
Statistic data:
summary(dt_students)
student_id age gender study_hours_per_day social_media_hours netflix_hours part_time_job attendance_percentage sleep_hours diet_quality exercise_frequency parental_education_level internet_quality mental_health_rating
Length:1000 Min. :17.00 Length:1000 Min. :0.00 Min. :0.000 Min. :0.000 Length:1000 Min. : 56.00 Min. : 3.20 Length:1000 Min. :0.000 Length:1000 Length:1000 Min. : 1.000
Class :character 1st Qu.:18.75 Class :character 1st Qu.:2.60 1st Qu.:1.700 1st Qu.:1.000 Class :character 1st Qu.: 78.00 1st Qu.: 5.60 Class :character 1st Qu.:1.000 Class :character Class :character 1st Qu.: 3.000
Mode :character Median :20.00 Mode :character Median :3.50 Median :2.500 Median :1.800 Mode :character Median : 84.40 Median : 6.50 Mode :character Median :3.000 Mode :character Mode :character Median : 5.000
Mean :20.50 Mean :3.55 Mean :2.506 Mean :1.820 Mean : 84.13 Mean : 6.47 Mean :3.042 Mean : 5.438
3rd Qu.:23.00 3rd Qu.:4.50 3rd Qu.:3.300 3rd Qu.:2.525 3rd Qu.: 91.03 3rd Qu.: 7.30 3rd Qu.:5.000 3rd Qu.: 8.000
Max. :24.00 Max. :8.30 Max. :7.200 Max. :5.400 Max. :100.00 Max. :10.00 Max. :6.000 Max. :10.000
extracurricular_participation exam_score
Length:1000 Min. : 18.40
Class :character 1st Qu.: 58.48
Mode :character Median : 70.50
Mean : 69.60
3rd Qu.: 81.33
Max. :100.00
and this is a sample of dataset:
head(dt_houses)
I would like to start from density of a main values, which are from my domain knowledge are important for the best performance at the university.
density:
ggplot(data = dt_students, aes(x = attendance_percentage)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
geom_vline(xintercept = mean(dt_students$attendance_percentage), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
This density plot visualizes the distribution of the attendance percentage, showing that most students attend classes at a rate of ~85% roughly, and this is a right-skewed distribution. The dashed vertical line represents the mean attendance percentage (~84-85%). The plot shows, that the majority of students are attending most of the classes.
Area density:
ggplot(data = dt_houses, aes(x = area)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
The area density plot looks similar to price density plot and can also make sense, because if house has a bigger area, the higher cost is quite expected. This plot shows that most houses are having area in range ~3000-5000. But some properties have area more than 12000.
Next plot will visualize the distribution of price depending on area.
ggplot() +
geom_point(data = dt_houses, aes(x = area, y = price, color = parking)) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
This scatter plot visualizes the relationship between house area (x-axis) and price (y-axis), with color indicating the number of parking spaces. It shows a positive correlation between area and price—larger houses tend to be more expensive. However, there is some variability, as some large houses have relatively lower prices. The color gradient suggests that houses with more parking spaces (lighter blue) tend to be higher in price and larger in area.
The next plot, which I am going to do is a boxplot and I want to use bedrooms as a factor variable on x axis and price on y-axis, to get an overall understanding, how amount of bedrooms affect price.
ggplot(data = dt_houses, aes(x = factor(bedrooms), y = price)) +
geom_boxplot() +
theme_minimal()
Boxplot shows, that on average, houses with more bedrooms have higher prices, but around 4-6 bedrooms, 1 quantile stagnates, and so does median price. There are some outliers, but not too much.
It is also interesting to take a look at distribution of bedrooms, so next plot would be a histogram, because I want to know, which amount of bedrooms is the most “popular” in the whole dataset.
ggplot(data = dt_houses, aes(x = bedrooms)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$bedrooms), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
mean of the bedrooms:
mean(dt_houses$bedrooms)
[1] 2.965138
From this visualization we can mention, that the most of the houses have 2, 3 or 4 rooms. 1, 5 and 6 rooms are not as popular in this dataset.
Let’s have a look at histogram of stories:
ggplot(data = dt_houses, aes(x = stories)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$stories), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
mean(dt_houses$stories)
[1] 1.805505
This plot shows that most popular amount of stories are 1 and 2. 3 and 4 makeing less than 100 houses together.
Bathrooms are also interesting variable, so let’s take a look at histogram and a Boxplot bathrooms and price:
ggplot(data = dt_houses, aes(x = bathrooms)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$bathrooms), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
ggplot(data = dt_houses, aes(x = factor(bathrooms), y = price)) +
geom_boxplot() +
theme_minimal()
here it is also almost obvious, that, if we have more bathrooms, price will be also up. Only one disadvantage, that in my dataset I do not have enough data about properties with 3 or 4 bathrooms, I have some on 3, but really luck on 4.
Furnishing is also important, many people search for apartments with furniture, but furniture could be not in a best shape or buyer may do not like the style. So from my opinion, it is not as strong(in prediction), as for example area.
How much real estate furnished or not:
ggplot(data = dt_houses, aes(x = factor(furnishingstatus), fill = factor(furnishingstatus))) +
geom_bar(color="#ced4da", alpha=0.25) +
scale_fill_viridis_d(option = "D") +
labs(title = "Bar Chart with Different Colors",
x = "Furnishing Status",
y = "Count") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
We can see, that most of the houses are semi-furnished. which is also logical, because when we sell a house or apartment, probably we would take in most of the cases the most valuable things for us and furniture included.
Now, it would be great, to look at price and area distribution in differently furnished properties
ggplot(data = dt_houses, aes(y = price, x = area)) +
geom_point(data = dt_houses, aes(y = price, x = area, color = bedrooms)) +
geom_hline(yintercept = mean(dt_houses$price), linetype='dashed') +
facet_grid(.~furnishingstatus) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
scale_color_distiller(type = "seq", palette = "Greens") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
Also, on average, you can notice, that unfurnished houses, are less expensive.
We can also take a look on some pie charts:
dt_mainroad_counts <- as.data.frame(table(dt_houses$mainroad)) #table() - creates frequency table
colnames(dt_mainroad_counts) <- c("mainroad_status", "count")
dt_mainroad_counts$percentage <- round(dt_mainroad_counts$count / sum(dt_mainroad_counts$count) * 100, 1)
ggplot(data = dt_mainroad_counts, aes(x = "", y = count, fill = mainroad_status)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5), color = "white", size = 4) +
theme_void() +
scale_fill_manual(values = c("#F1B147", "#47B1F1")) +
labs(
title = "Distribution of Mainroad Status",
fill = "Mainroad Status"
)
Almost 86 percent of houses have main road, so maybe this won’t be a strong predictor variable.
dt_airconditioning_counts <- as.data.frame(table(dt_houses$airconditioning)) #table() - creates frequency table
colnames(dt_airconditioning_counts) <- c("airconditioning_status", "count")
dt_airconditioning_counts$percentage <- round(dt_airconditioning_counts$count / sum(dt_airconditioning_counts$count) * 100, 1)
ggplot(data = dt_airconditioning_counts, aes(x = "", y = count, fill = airconditioning_status)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5), color = "white", size = 4) +
theme_void() +
scale_fill_manual(values = c("#F1B147", "#47B1F1")) +
labs(
title = "Distribution of Airconditioning status",
fill = "Airconditioning Status"
)
Here 68.4 percent has airconditioning, but I do not know, how it will affect predictions.
I think that would be enough exploration and we can start with models.
First, I would like to start pretty simple with linear model.
I consider to take all variables to my model, because they all seem to be very important.
I will use lm function in R to find needed beta coefficients and create my model
price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2619718 -657322 -68409 507176 5166695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42771.69 264313.31 0.162 0.871508
area 244.14 24.29 10.052 < 2e-16 ***
bedrooms 114787.56 72598.66 1.581 0.114445
hotwaterheatingyes 855447.15 223152.69 3.833 0.000141 ***
airconditioningyes 864958.31 108354.51 7.983 8.91e-15 ***
stories 450848.00 64168.93 7.026 6.55e-12 ***
mainroadyes 421272.59 142224.13 2.962 0.003193 **
parking 277107.10 58525.89 4.735 2.82e-06 ***
furnishingstatussemi-furnished -46344.62 116574.09 -0.398 0.691118
furnishingstatusunfurnished -411234.39 126210.56 -3.258 0.001192 **
bathrooms 987668.11 103361.98 9.555 < 2e-16 ***
guestroomyes 300525.86 131710.22 2.282 0.022901 *
basementyes 350106.90 110284.06 3.175 0.001587 **
prefareayes 651543.80 115682.34 5.632 2.89e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1068000 on 531 degrees of freedom
Multiple R-squared: 0.6818, Adjusted R-squared: 0.674
F-statistic: 87.52 on 13 and 531 DF, p-value: < 2.2e-16
We got 0.68 R-squared, which is not that bad for a model just made up. But that’s not all, I will try to do better here, but first, another model.
But I would like to measure performance of my models with RMSE, so I will calculate RMSE for linear model.
price_lm_rmse <- mean(sqrt(abs(price_lm$residuals)))
price_lm_rmse
[1] 797.382
I think this model could perform better, because there some variables which can affect this model not only linearly, but the other way, in this case tree model can show better performance.
In this coursework will be used rpart to create a regression tree.
prices_tree <- rpart(data = dt_houses, formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, method = 'anova')
prp(prices_tree, digits = -3)
printcp(prices_tree)
Regression tree:
rpart(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea, data = dt_houses, method = "anova")
Variables actually used in tree construction:
[1] airconditioning area basement bathrooms furnishingstatus parking
Root node error: 1.9032e+15/545 = 3.4921e+12
n= 545
CP nsplit rel error xerror xstd
1 0.304946 0 1.00000 1.00308 0.085214
2 0.094553 1 0.69505 0.73722 0.064021
3 0.053743 2 0.60050 0.63705 0.055702
4 0.026381 3 0.54676 0.60469 0.052423
5 0.024922 4 0.52038 0.63046 0.056608
6 0.022993 5 0.49546 0.63335 0.056276
7 0.021374 6 0.47246 0.63383 0.056693
8 0.015261 7 0.45109 0.60699 0.055365
9 0.013952 8 0.43583 0.59339 0.055453
10 0.012386 9 0.42188 0.57479 0.054000
11 0.010000 10 0.40949 0.54864 0.050812
Now after I have built with the help of rpart tree model based on my dataset, let us explore it:
prices_tree
n= 545
node), split, n, deviance, yval
* denotes terminal node
1) root 545 1.903208e+15 4766729
2) area< 5954 361 6.066751e+14 4029993
4) bathrooms< 1.5 293 3.297298e+14 3773561
8) area< 4016 174 1.437122e+14 3431227
16) furnishingstatus=unfurnished 78 4.036605e+13 2977962 *
17) furnishingstatus=furnished,semi-furnished 96 7.430067e+13 3799505 *
9) area>=4016 119 1.358098e+14 4274118 *
5) bathrooms>=1.5 68 1.746610e+14 5134912
10) airconditioning=no 44 7.024826e+13 4563682 *
11) airconditioning=yes 24 6.373358e+13 6182167 *
3) area>=5954 184 7.161564e+14 6212174
6) bathrooms< 1.5 108 2.869179e+14 5382579
12) airconditioning=no 65 1.170629e+14 4843569
24) basement=no 38 5.226335e+13 4304816 *
25) basement=yes 27 3.824662e+13 5601815 *
13) airconditioning=yes 43 1.224240e+14 6197360 *
7) bathrooms>=1.5 76 2.492851e+14 7391072
14) parking< 1.5 51 7.184700e+13 6859794 *
15) parking>=1.5 25 1.336772e+14 8474878
30) airconditioning=no 10 5.146311e+13 7285600 *
31) airconditioning=yes 15 5.864106e+13 9267729 *
We can see, that we have 31 Nodes, I think for this kind of dataset it may be okay.
Now it would be great to prune the tree, because I do not want my tree to overfit:
plotcp(prices_tree)
This is complexity of this tree. We need the lowest complexity, to get as few leafs as possible to get the best performance, so that tree won’t overfit the data.
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prp(prices_tree,digits = -3)
after we pruned the tree, let’s calculate the RMSE for the tree model
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea")])
prices_tree_rmse <- mean(sqrt(abs(dt_houses$price - prices_tree_pred)))
prices_tree_rmse
[1] 860.0223
Before I start working with PCA, this is important to normalize the data, so that measurement scale will not affect PCs.
dt_pca <- data.table(scale(dt_houses[,c("area", "bedrooms", "bathrooms", "stories", "parking")]))
Now I want to run my PCA with help of prcomp and get the summary, to dive in to Data.
summary(dt_houses_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 1.3878 1.0927 0.8112 0.8031 0.7597
Proportion of Variance 0.3852 0.2388 0.1316 0.1290 0.1154
Cumulative Proportion 0.3852 0.6240 0.7556 0.8846 1.0000
That looks interesting, but I would like to plot it, to visualize it, so that it could be easier to understand.
fviz_eig(dt_houses_pca, addlabels = TRUE) +
theme(plot.title = element_text(hjust = 0.5))
From this plot it is obvious that all PC are useful and reducing dimensionality will not be beneficial, because last 3 PCs contribute approx. 12% each, which is a fair amount in this case. So I would like to use all PCs, because it is not always necessary to cut dimensionality and it depends on how much variablse we actually have, and how big of a contribution makes each PC. May be in this case “change of basis” will perform better.
Now as I have calculated PCs, it is time to run models with new inputs.
First we will start off with linear model.
pc_table <- dt_houses_pca$x
dt_houses_pc_table_for_lm <- data.frame(price = dt_houses$price, pc_table)
price_lm_pca <- lm(formula = price ~ ., data = dt_houses_pc_table_for_lm)
summary(price_lm_pca)
Call:
lm(formula = price ~ ., data = dt_houses_pc_table_for_lm)
Residuals:
Min 1Q Median 3Q Max
-3396744 -731825 -64056 601486 5651126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4766729 53296 89.439 < 2e-16 ***
PC1 -950194 38439 -24.719 < 2e-16 ***
PC2 -295175 48820 -6.046 2.77e-09 ***
PC3 86943 65760 1.322 0.187
PC4 320544 66423 4.826 1.82e-06 ***
PC5 -296109 70223 -4.217 2.91e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1244000 on 539 degrees of freedom
Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
price_lm_pca_rmse <- mean(sqrt(abs(price_lm_pca$residuals)))
price_lm_pca_rmse
[1] 857.5466
Now, after PCA the results are worse and I think this is because there are fair amount of binary variables in this dataset and PCA could not capture all information, espesially information, which is captured in binary variables, such as “mainroad”, “airconditioning” and so on.